Getting started

Ben Whalley, Paul Sharpe, Sonja Heintz, Andy Wills

Overview

This session doesn’t assume any prior knowledge of R, and introduces the basics. For BSc students this will include some revision of material from stage 1. However we provide additional explanation and extension material for students to test their knowledge and extend familiar skills. We find most students benefit from refreshing their knowledge at this stage in the course.

Even if you are quite confident when using RStudio please read the worksheets carefully and complete all of the activities in the blue boxes.

Using the RStudio interface

  • Access RStudio at https://rstudio.plymouth.ac.uk
  • Use the latest version of the Chrome web browser
  • Tell R what to do in the Console pane
  • See the Environment pane for stored data
  • Use the Files pane to open code and data from a folder on the server

No code in this video!

If you’re using Windows or an older Mac we strongly recommend downloading Chrome and using that. If you have any issues with RStudio this is likely the first suggestion we will make.

When you login to RStudio, you’ll be greeted with a screen that looks something like the image below.

RStudio on first opening

You can see three parts:

  1. The Console - This is the large rectangle on the left. It is where you tell R what to do, and where R prints the answers to your questions.

  2. The Environment - This is the rectangle on the top right. It is where R keeps a list of the data it knows about. It’s empty at the moment, because we haven’t given R any data yet.

  3. The Files - This is the rectangle on the bottom right. It’s a bit like the File Explorer in Windows, or the Finder on a Mac. It shows you what files and folders R can see.

You should also be able to see that the two rectangles on the right have a number of other “tabs”. These work like tabs on a web browser.

The top rectangle has the tabs Environment and History. The History tab keeps a record of what you’ve recently typed into the Console. This can sometimes be useful.

The bottom rectangle has the tabs Files, Plots, Packages, Help, and Viewer. We’ll cover what these other tabs do later on.

Before you start

  • Before starting you must run some R code to get set up.
  • See the code tab or the exercise below.
# run an R script over the internet which will get you
# set up, and copy files you need to your home folder
source("https://raw.githubusercontent.com/benwhalley/lifesavR/main/bootstrap.R")

To get everyone off to the same start we have created a script that copies some files into your home folder on the RStudio server.

To run this script, we just copy and paste the following line into the Console:

source("https://raw.githubusercontent.com/benwhalley/lifesavR/main/bootstrap.R")
  1. Click on the Console pane.
  2. Copy-paste the following into the console:

source("https://raw.githubusercontent.com/benwhalley/lifesavR/main/bootstrap.R")

Your console should now look like this:

Press ↩︎ to run the code. If your console looks like the image below, then you are ready to start the session.

Using the workbooks

  • Each session has an associated “workbook” file
  • They end with the file extension ".rmd"
  • These were copied to your home folder by the bootstrap script (above)
  • Use them to complete the exercises in the worksheet

No code shown in this video

Each session has an associated “workbook” file which you will use to complete the exercises in the worksheet. The file you need for this session is called session-1.rmd.

If you click on the file it opens the workbook in a tab of a new pane, called the Source pane. It’s called the Source pane because statements writting in the R language are often referred to as ‘R code’, which is shorthand for ‘R source code’. The source pane allows you to write R code and explore your data.

Click on session-1.rmd in the Files pane.

If you’re able to open this file you are now ready to start the rest of the session.

What can R do?

  • R is a multi-purpose tool
  • It can do simple arithmetic, load data, make plots etc.
  • It can also run any statistical analysis you like
  • You need to tell R exactly what to do, by providing precise instructions
  • These instructions (code) provide a reproducible record of your work
# multiply two numbers
2 * 221

# generate some random numbers with a normal distribution
rnorm(10, 0,1)

# histogram plot of random numbers
hist(rnorm(100, 0,1))

R is a computer language for data analysis and visualisation.

RStudio is a user interface to R; it helps you organise your work.

R is a text-based language. You interact with it by typing commands and running (also called ‘executing’) them.

R can do everything from simple arithmetic and plotting to complex data analysis.

For example, you can do simple arithmetic like

[2 * 221]

We could generate some random numbers with a normal distribution

rnorm(10, 0,1)
 [1]  0.4065467  0.9944206  0.8557684  0.1971289  0.8343250  0.8467902  1.9541053 -2.1492600  0.9711203  1.1450616

And we could plot random numbers using a histogram

hist(rnorm(100, 0,1))


You should think of R as a robot.

The robot is extremely fast, powerful and tireless; but it’s also literal-minded, and won’t think for itself or take the initiative. You need to tell it exactly what to do, by providing very precise instructions.

The advantage of writing detailed instructions is that you have a detailed, reproducible version of all your analyses.

Reproducibility is a key topic in psychology and other natural sciences — learning R (or something like it) is an important skill for new psychologists.

Introducing RMarkdown

  • RMarkdown documents combine ‘chunks’ of R code with regular text
  • This means you can keep notes and explanations right next to where you process and anayse data
  • RMarkdown files end with “.rmd” or “.Rmd
  • Make a new chunk: Ctrl + Alt + I or ⌘ + Alt + I (Mac)

Finding backticks:

So far you have seen some of the RStudio interface. Now we’re going to encounter RMarkdown documents.

RMarkdown documents are a good way to use R which we can use in RStudio to make our lives easier.

RMarkdown is a file format which combines R code (chunks) with regular text. This means RMarkdown can combine data analysis and graphs with explanatory text, which makes it easier for you to work, keep notes, and communicate your work to others.

Once the document is ready there iss an extra step called ‘knitting’ in which the R code in your file is run, and the results are interspersed with your text to make a file you can share with others. Typically that’s an html file (i.e. a web page) or sometimes a pdf.

This allows us to make high quality reports, research papers, dissertations or books, and is becoming very popular as part of the general move to encourage reproducibility in psychological research.


Because it’s such a powerful tool, this module provides an early introduction to RMarkdown. We don’t use all it’s features just yet, though.

For the moment, we’ll only be using Rmd document as an interactive interface for running code and looking at the results R produces.

Using RMarkdown

A neat feature of .Rmd files is that, when you open them in RStudio, they make it easy to organise and run R code, and see the outputs.

Provided you have already run the bootstrap script, click on the lifesavr folder in the Files pane of RStudio,

  • Click on files then lifesavr, then exercises folders

You’ll notice that some files have the extension .rmd. These are R Markdown files.

  • Highlight file extension by selecting or pointing with mouse

The file extension .rmd (or .Rmd) is important, because this is how R Studio knows that the files contain a mixture of R code and regular text.

If you don’t save files with the right extension, R has trouble knowing what to do with it.

  • Open the simplest-rmarkdown-example.rmd file

Code chunks

RMarkdown makes a distinction between R code and regular, narrative text.

This tells RStudio how to treat each part of your document — whether to display it as text, or format it as code (and give us warnings when there are errors and so on).

This is done by putting the code inside some special characters, called backticks, creating a code chunk.

A chunk is opened using the symbols ```{r}, and closed using the symbols ```. This is what a chunk looks like in RStudio:

A code chunk in the RMarkdown editor


NOTE: The symbols which start and end a chunk are backticks, not single quotes. The difference is quite subtle.

Backticks are on your keyboard here if you’re on Windows:

On windows

Or here if you’re on a Mac:

On a Mac

Running R code inside chunks

In the picture of a chunk above you might have noticed that we had both some code (2+2) and the result from that code was shown beneath the chunk (4!). To show the result of our code, we first need to run it.

There are three ways to run R code within a chunk:

  • Run a whole chunk at once (can include multiple statements)
  • Run one or more related lines, called a statement
  • Select and run just part of a statement

Running the whole chunk

  • Show the green play button in action

Running a statement

The most common case is when we have given R an instruction (written a statement) and we want to run just that new code.

A statement is cometimes one line of code, but it can include multiple lines that are related — we’ll see more of that later.

For now, you can see we have a code chunk here:

# some arithmetic in R

2 + 4 + 8
[1] 14
  1. Show cursor interacting with this code chunk
  2. Place cursor anywhere on the line
  3. Run statement using Ctrl + Enter

You can see that we put the cursor in the middle of this line of numbers that we wanted to add up. Then I pressed Ctrl + (this would be the Cmd key on a Mac).

This runs or executes the whole line, and the result is shown below the code chunk.

Code chunks with multiple statements

We might want to add some extra calculations to our chunk.

# some arithmetic in R

2 + 2 + 8 + 16
[1] 28

42 * 42
[1] 1764

2 +
  4 +
  8 +
  16
[1] 30

Now we have 2 statements: 2 + 4 + 8 + 16 and 42 * 42 (the star means multiply). We can run either of the statements in the same way: by putting the cursor anywhere on the line and pressing Ctrl/Cmd + Enter

  • Demonstrate doing this in rstudio show that the cursor can be anywhere on the line and also
  • That if we execute a second statement the result of the first one disappears
  • Also point out that I did separate all the statements with a blank line to make it easier for myself to read (but R doesn t care about this and can work out whether the lines are related by itself)

Exercise 1

  1. Locate the first chunk in session-1.rmd (you find this in the Files pane).
  2. Place your cursor (anywhere) on the line of R code.
  3. Run the code by pressing Ctrl + (Windows, Linux) or + ↩︎ (Mac).

You should see the result of the sum appear below the chunk:

[1] 42

Congratulations! You have just run your first line of R. You can also run part of a line by highlighting just the code you want to run, as you’ll see in the next exercise.

Exercise 2

  1. Select (highlight) the last two numbers in the sum.
  2. Run the code.

This adds up two of the three numbers:

Example of running highlighted code

Exercise 3: Making new chunks

  1. Find the instructions for Exercise 3 in your workbook.
  2. Create a new chunk below the instructions.
  3. Inside the chunk, write a line of code which adds together the numbers 9, 4, 55 and 2.
  4. Run the the line of code you have written.

The output from the chunk should look like this:

Result from Exercise 3

Packages

  • Loading a package adds functionality to R
  • Some packages (like tidyverse and pysdata) also include example datasets
  • To load tidyverse write library(tidyverse)
  • Load tidyverse and psydata before each session

The following R code is used in the video:

# load the tidyverse package
# (this also loads the diamonds example dataset, and some others)
library(tidyverse)
diamonds %>%
  ggplot(aes(carat, price)) +
  geom_point()

By loading ‘packages’, you can add extra functions and datasets to R.

Packages are a powerful feature which allow R to be extended. This means you can run almost any analysis, or make any type of plot.

Packages are loaded using the library() function.

The command library(tidyverse) loads some additional functions and data which will allow us to make a scatter plot.

The tidyverse package is so fundamental to this course that library(tidyverse) is likely to be the first line of R code, in the first chunk, in all your RMarkdown files.

It’s a good idea to start your documents with a chunk which loads any packages you need. This makes it easy to see which have been loaded, and avoids loading them twice which is occasionally a problem.

You also need to remember to actually run the lines of code to load the libraries. Beginners often forget to do this — but it’s an easy error to fix.

  • Demonstrate error when running following plot if tidyverse hasn’t been loaded
library(tidyverse)
diamonds %>%
  ggplot(aes(carat, price)) +
  geom_point()
  1. Load tidyverse
  2. Re-run example plot to show that it now works

If you’ve understood what packages are it should be clear you need to load them first, before doing anything else.

You can’t use the functions provided by tidyverse until you’ve run the command: library(tidyverse). And the data in psydata is not available until after you load that package.

For example, if you tried to produce a scatter plot before loading tidyverse you’d see an error like this in the console pane:

Error in diamonds %>% ggplot(aes(carat, price, colour = clarity)) :
  could not find function "%>%"

This is important to remember: could not find function errors are one of the most common problems that beginners encounter. They normally mean that you have

  1. forgotten to include library(tidyverse) as the first line in your code, or
  2. forgotten to run that line.

Datasets

Datasets are like spreadsheets. They have have:

  • multiple rows, with one row per observation
  • multiple columns; each column has a name.
  • columns also (sometimes) get called variables; this can be confusing

Where are datasets?

  • R has some built-in datasets as learning examples
  • The psydata package includes datasets used in this course
  • Later on, we will import data from files (e.g. actual spreadsheets)

Exploring and checking data

  • View a whole dataset by typing its name and running it in a code chunk
  • glimpse() shows a list of all the columns, plus a few of the datapoints
  • The Environment pane shows a spreadsheet-like view of the data
# always laod the tidyverse first
library(tidyverse)

# the psydata package contains datasets for this course
library(psydata)

# display the `fuel` dataset, by typing the name
# and running this in a code chunk
fuel

# show only the first 6 rows of the `fuel` data
head(fuel)

# shows a list of columns in the `development` dataset
# plus the first few datapoints (as many as will fit)
glimpse(development)

When we say “dataset” or “data”, we mean something like a spreadsheet: In R, datasets contain values are organised into columns and rows.

One distinction to make though is that by dataset we mean data that has been loaded into R— data files are different thing.

In R, datasets are normally stored in a container called a data.frame. They can also be stored in a tibble (these are basically the same thing).

Packaged datasets

Some datasets are built into R packages as examples for beginners.

For this course, we created a package called psydata which includes the data we need for teaching.

This is installed on the RStudio server. To load it we run:

library(psydata)

We can see from the loading message that one of the datasets is called fuel. This contains data about cars — things like weight, fuel economy, engine size.

Let’s display this data in using a new chunk. If we type the word fuel, select this variable name with our cursor, and ‘execute’ it, we can see the data it contains:

fuel
    mpg cyl engine_size power weight gear automatic
1  21.0   6        2620   110   1188    4      TRUE
2  21.0   6        2620   110   1304    4      TRUE
3  22.8   4        1770    93   1052    4      TRUE
4  21.4   6        4230   110   1458    3     FALSE
5  18.7   8        5900   175   1560    3     FALSE
6  18.1   6        3690   105   1569    3     FALSE
7  14.3   8        5900   245   1619    3     FALSE
8  24.4   4        2400    62   1447    4     FALSE
9  22.8   4        2310    95   1429    4     FALSE
10 19.2   6        2750   123   1560    4     FALSE
11 17.8   6        2750   123   1560    4     FALSE
12 16.4   8        4520   180   1846    3     FALSE
13 17.3   8        4520   180   1692    3     FALSE
14 15.2   8        4520   180   1715    3     FALSE
15 10.4   8        7730   205   2381    3     FALSE
16 10.4   8        7540   215   2460    3     FALSE
17 14.7   8        7210   230   2424    3     FALSE
18 32.4   4        1290    66    998    4      TRUE
19 30.4   4        1240    52    733    4      TRUE
20 33.9   4        1170    65    832    4      TRUE
21 21.5   4        1970    97   1118    3     FALSE
22 15.5   8        5210   150   1597    3     FALSE
23 15.2   8        4980   150   1558    3     FALSE
24 13.3   8        5740   245   1742    3     FALSE
25 19.2   8        6550   175   1744    3     FALSE
26 27.3   4        1290    66    878    4      TRUE
27 26.0   4        1970    91    971    5      TRUE
28 30.4   4        1560   113    686    5      TRUE
29 15.8   8        5750   264   1438    5      TRUE
30 19.7   6        2380   175   1256    5      TRUE
31 15.0   8        4930   335   1619    5      TRUE
32 21.4   4        1980   109   1261    4      TRUE

By default this shows the first ten rows and columns of the data. You can see other rows using the Next, Previous and number buttons below the data.

If your browser window is very narrow you may need to view some of the columns by using the arrow next to the final, right-hand column.

You can get information about the columns in all these example datasets by typing: help(name_of_the_dataset_you_want_to_know_about). For example:

help(fuel)
No documentation for 'fuel' in specified packages and libraries:
you could try '??fuel'

Columns

Each column in a dataset has a name.

We sometimes call the columns variables, because each column will often relate to a variable in our study.

However, this can be a bit confusing because — in R — variables can actually contain whole datasets. fuel, for example, is the name of a variable which contains an example dataset, provided by the psydata package.

  • Show library(psydata) and then the fuel dataset

But these words are used flexibly and interchangeably, so we’ll just have to get used to it. It’s normally clear which type of variable we mean from the context.

Rows

Each row in a dataset represents an observation.

In different datasets an observation might correspond to an individual participant, a whole country, or even just a single button press in an experiment.

Exploring and checking data

There are a two ways we recommend you use inspect and check data you are using.

  1. Typing the name of the dataset, and running that as code
  2. The glimpse() function, which shows a list of all the columns and some of the data

To use glimpse:

fuel
    mpg cyl engine_size power weight gear automatic
1  21.0   6        2620   110   1188    4      TRUE
2  21.0   6        2620   110   1304    4      TRUE
3  22.8   4        1770    93   1052    4      TRUE
4  21.4   6        4230   110   1458    3     FALSE
5  18.7   8        5900   175   1560    3     FALSE
6  18.1   6        3690   105   1569    3     FALSE
7  14.3   8        5900   245   1619    3     FALSE
8  24.4   4        2400    62   1447    4     FALSE
9  22.8   4        2310    95   1429    4     FALSE
10 19.2   6        2750   123   1560    4     FALSE
11 17.8   6        2750   123   1560    4     FALSE
12 16.4   8        4520   180   1846    3     FALSE
13 17.3   8        4520   180   1692    3     FALSE
14 15.2   8        4520   180   1715    3     FALSE
15 10.4   8        7730   205   2381    3     FALSE
16 10.4   8        7540   215   2460    3     FALSE
17 14.7   8        7210   230   2424    3     FALSE
18 32.4   4        1290    66    998    4      TRUE
19 30.4   4        1240    52    733    4      TRUE
20 33.9   4        1170    65    832    4      TRUE
21 21.5   4        1970    97   1118    3     FALSE
22 15.5   8        5210   150   1597    3     FALSE
23 15.2   8        4980   150   1558    3     FALSE
24 13.3   8        5740   245   1742    3     FALSE
25 19.2   8        6550   175   1744    3     FALSE
26 27.3   4        1290    66    878    4      TRUE
27 26.0   4        1970    91    971    5      TRUE
28 30.4   4        1560   113    686    5      TRUE
29 15.8   8        5750   264   1438    5      TRUE
30 19.7   6        2380   175   1256    5      TRUE
31 15.0   8        4930   335   1619    5      TRUE
32 21.4   4        1980   109   1261    4      TRUE
glimpse(fuel)
Rows: 32
Columns: 7
$ mpg         <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, …
$ cyl         <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4
$ engine_size <dbl> 2620, 2620, 1770, 4230, 5900, 3690, 5900, 2400, 2310, 2750, 2750, 4520, 4520, 4520, 7730, 7540, 7210, 1290, 1240, 1170, 1970, …
$ power       <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230, 66, 52, 65, 97, 150, 150, 245, 175, 66, 91, …
$ weight      <dbl> 1188, 1304, 1052, 1458, 1560, 1569, 1619, 1447, 1429, 1560, 1560, 1846, 1692, 1715, 2381, 2460, 2424, 998, 733, 832, 1118, 159…
$ gear        <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 5, 4
$ automatic   <lgl> TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE…

glimpse shows a list of all the columns in the dataset, the type of data stored in each column, and as many observations (datapoonts) as will fit on a single line.

glimpse is a really useful view to check which columns are available in a dataset before using them.

Why are we talking about cars and flowers and not psychology?

In this course we mostly use very simple datasets, and some of them aren’t even about psychology.

Some students ask why we don’t always use psychological examples. If this hasn’t troubled you then you could skip to the next section, but we thought we should explain:

We think the fuel dataset (and others, like iris, and development) have a number of benefits.

First, they are either built into R, loaded in common packages, or available in the psydata package. This makes them easily available for everyone.

Second, these data relate to concrete, easy to understand phenemena (e.g. weight, length, number of gears). This means you don’t have to hold in mind any complex psychological/theoretical ideas for the examples to make sense.

Third, the relationships in these datasets are clear, and there aren’t too many data points. Real data are often more messy because many psychological constructs are hard to measure.

Our experience is that, when learning R, it pays to keep everything as simple as it possibly can be. The skills and concepts involved in analysing these data are the same though.

R — and the techniques and statistics we teach — are used right across the natural sciences

If you’re still not convinced — don’t worry … we do include some clinical examples, and we will be collecting our own psychological data soon enough and analysing that.

Exercise 4

  1. Open your workbook for this workshop (called session-1.rmd).
  2. Create a new chunk below the Exercise 4 instructions.
  3. Load the psydata package.
  4. Look at the fuel data using the glimpse() function.
  5. Display the fuel dataset and try out the navigation buttons.
  6. Write a line of code which makes a list of columns in the development dataset.

The output should look something like this:

The fuel dataset

Columns in the development dataset

Exercise 5

In your workbook (session-1.rmd):

  1. Create a new chunk below the Exercise 5 instructions.
  2. Load the psydata package (if you haven’t already).
  3. Show the first 10 rows of the development data using head().

Use the output to answer the following question. After entering your answer, click outside the box. The border will turn turn blue when the answer is correct.

The population of Afghanistan in 1967 was: .

Take a break!

Our student pilot-testers suggested that now would be a good time for a short break!

Scatterplots

  • A scatterplot shows the relationship between two continuous variables (columns)
  • Each observation (row) must have at least two values (so we need two columns)
  • These define the position of a point on the x and y axes of the plot
  • Use ggplot()
  • aes(x = ..., y = ...) chooses the x and y data columns and creates the axes
  • geom_point() adds the points
# if you have not already, load these packages
library(tidyverse)
library(psydata)

# make a scatterplot from the fuel dataset
fuel %>%
  ggplot(aes(x=weight, y=mpg)) + # selects the columns to use
    geom_point()                 # adds the points to the plot

# the same plot, this time we left out x= and y= in
# the aes code. These are implicit from the order of weight and mpg
# (the x-axis comes first)
fuel %>%
  ggplot(aes(weight, mpg)) +
    geom_point()         

A scatterplot shows the relationship between two variables by plotting their values as points on the x-axis (the left-right position) and the y-axis (up-down).

This code chunk creates a scatterplot using the fuel dataset.

  • Create chunk and type code
library(tidyverse)
library(psydata)
fuel %>%
  ggplot(aes(x = weight, y = mpg)) +
    geom_point()

  • select the pipe with the cursor to highlight

The %>% symbol is special, it’s called a ‘pipe’. We’ll cover the pipe in a later session, but for now you just need to know that it sends the fuel data on to the next line of code — like it’s passing it down a pipe.

The second line receives the data. The ggplot() function means we’re making a plot.

The plot is built in two steps:

The first step

  • select ggplot(aes(weight, mpg))

selects columns in our dataset to use for the x and y axes. In this case, the x-axis is weight, which is the weight of the cars in kg. And mpg is miles per gallon, or fuel efficiency. This will be the y-axis.

  • select weight in the code - highlight it is the x-axis
  • same for mpg and y-axis

We can see the plot if we run this statement using the keyboard shortcut — Ctrl or Cmd (Mac) + Enter

  • run the code and show the resulting plot
  • emphasise this is shown below the code chunk when using RMarkdown

Building plots in layers

A useful thing to know is that ggplot works by building up plots in multiple layers.

If we run just this part of the code, we can see the plot with just the axes, and no data shown.

  1. run just the first two lines of code by selecting and pressing ctrl+enter
  2. emphasise the axes are there but no data shown
  3. rerun all code to plot points

So, conceptually, we make plots by:

  1. selecting data
  2. defining the axes, and then
  3. drawing the data points

Each part of the plot is separated by a + symbol and goes on a new line.

RStudio is smart and knows all this is part of the same statement. This means it automatically indents the code.

Cutting corners

There’s just one final thing to explain: In the previous code we wrote x = weight and y = mpg.

This makes things explicit, which is nice, but takes longer to type. You can also write the plot this way:

fuel %>%
  ggplot(aes(weight, mpg)) +
    geom_point()         

R assumes that the first column is the x-axis and the second is the y-axis.

  • select x-axis and y-axis in turn when describing

We normally drop the x = and y = in these guides, and you should too.

Exercise 6

  1. Create a new chunk below the Exercise 6 instructions in your workbook.
  2. Using the fuel dataset, create a scatterplot with engine_size on the x-axis and mpg (miles per gallon, or fuel economy) on the y-axis.
  3. Run the chunk.

The scatterplot should look like this:

Working interactively

  • A statement is one of more lines of code which R knows are linked
  • Run a statement: Ctrl + ↵ (or ⌘ + ↩︎ Mac)
  • Run the whole chunk: Shift + Ctrl + ↵ (or Shift + ⌘ + ↩︎ Mac)

No code shown in this video

In an earlier video we introduced RMarkdown. In this clip, we recap on some useful things to know when using RMarkdown files in RStudio. It’s intended as a kind of reference for later on.

  • Running just part of a statement
  • Run a whole chunk using Cmd+Shift+Enter or Ctrl+Shift+Enter (Windows)
  • How R knows where statements start and end
  • The importance of blank lines
  • Loading packages first

Running parts of statements:

happy %>%
  filter(year==2018) %>%
  ggplot(aes(gdp_per_capita, happiness)) +
  geom_point()
Warning: Removed 6 rows containing missing values (geom_point).

  • Demonstrate selecting each part in turn again
  • Point out that we can check what data has been sent to the plot and amend if necessary
  • Demonstrate making an error where column is misspelled

Check your knowledge

Write an answer to each of these questions in the Check your knowledge section of your workbook. The answers will be revealed in Session 2.

  1. How do you run part of a line of R code using the keyboard short cut?
  2. Which library will you always need to load in your first R Markdown chunk?
  3. What is psydata?
  4. How would you look at/inspect a whole dataset?
  5. What does glimpse() do and when is it useful?
  6. What is the 5th column in the development dataset?
  7. Which function makes a plot? (there are many, but we mean the one shown above)
  8. Which function chooses the columns of data used in the plot?

Extension exercises

Please remember that these extension exercises are not required to pass the course. We include them because some students work through these materials much more quickly than others — perhaps because they have more previous experience with programming — and we aim to give all students the opportunity to stretch their skills.

If you do find you have extra time, however, these exercises are intended to provide additional practice in the techniques taught here, and to be useful preparation for using R independently in a stage 4 or MSc research project.

Extension exercise 1

This scatterplot uses the fuel dataset to show a vehicle’s power on the x-axis against mpg on the y-axis.

In a new chunk, write the R code to produce this plot.

Extension exercise 2

There is another built-in dataset called iris which includes data about different flower species.

Use glimpse() to get a list of the column names.

Make a scatterplot which shows the relationships between petal widths and lengths.

Congratulations!

You’ve completed all the exercises for this session, and are well on the way to working fluently in R. Well done!

Further reading

Scatterplots and visualisation: Fundamentals of Data Visualization is an excellent resource for data visualisation in R. This chapter: https://clauswilke.com/dataviz/visualizing-associations.html shows many examples of plots which display relationships between variables (including scatter plots) which would extend the material here.